Efficiently Scaling Transformer Inference

Pope, Reiner; Douglas, Sholto; Chowdhery, Aakanksha; Devlin, Jacob; Bradbury, James; Levskaya, Anselm; Heek, Jonathan; Xiao, Kefan; Agrawal, Shivani; Dean, Jeff

Computer Science > Machine Learning

arXiv:2211.05102 (cs)

[Submitted on 9 Nov 2022]

Title:Efficiently Scaling Transformer Inference

Authors:Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean

View PDF

Abstract:We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32x larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens, while supporting a long 2048-token context length on the PaLM 540B parameter model.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2211.05102 [cs.LG]
	(or arXiv:2211.05102v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2211.05102

Submission history

From: Aakanksha Chowdhery [view email]
[v1] Wed, 9 Nov 2022 18:50:38 UTC (645 KB)

Computer Science > Machine Learning

Title:Efficiently Scaling Transformer Inference

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Efficiently Scaling Transformer Inference

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators